Let’s practice bring up a very simple little map of the US from
maps and ggplot2. The maps
package provides latitude and longitude data for various in the package.
The vignette for the package can be found here if
you’d more information on the package. Although we will do something
much more complicated for our Project 2, for this little exploration
let’s just keep it simple by taking advantage of everything that is
pre-loaded in maps.
First, we need to extract state-level geographic information using
the map_data() function. Similar functions exist for other
geographic levels of data, including world (country), county, or other
regions of the world, but we will be using state.
states <- map_data("state")
| long | lat | group | order | region | subregion |
|---|---|---|---|---|---|
| -87.46201 | 30.38968 | 1 | 1 | alabama | NA |
| -87.48493 | 30.37249 | 1 | 2 | alabama | NA |
| -87.52503 | 30.37249 | 1 | 3 | alabama | NA |
| -87.53076 | 30.33239 | 1 | 4 | alabama | NA |
| -87.57087 | 30.32665 | 1 | 5 | alabama | NA |
| -87.58806 | 30.32665 | 1 | 6 | alabama | NA |
states variables are:long = longitude in unprojected
geographic coordinates (i.e., longitude and latitude on a sphere or
ellipsoid like WGS84); ranges from -180 to 180
lat = latitude in unprojected
geographic coordinates (i.e., longitude and latitude on a sphere or
ellipsoid like WGS84); ranges from -90 to 90
group = grouping variable for polygon boundaries
where all points in a given group represent a single polygon, like parts
of a country, state or county. E.g., here Alabama is treated as a single
group (labeled as 1 because its the first state alphabetically
order = indicates the order in which points should
be connected to draw the polygon; helps when connecting
long and lat coordinates to ensure the shape
is drawn in the correct sequence
region = represents the primary geographic unit
requested, e.g., here, we requested “state” so the region is takes the
state names as labels
subregion = refers to a subdivision within a region,
but is only present for detailed maps like “county” not “state”, hence
we have no subregion here
Now, let’s make a base map leveraging the ggplot2
package using the states dataset. Notice that
x will always be longitude, y will always be
latitude, and that we must also pass it a grouping variable. Why do you
think we need the group variable here?
The group variable defines how points should be connected to form shapes. Without it, ggplot will come up with a way to connect the points that will not make visual sense.
Great, we’ve made our first map of the continental US! It’s pretty uninteresting just as-is, so let’s use this map to overlay our pneumonia-related hospital readmissions data.
To perform spatial analysis, one needs geographic-level aggregate data that has been joined with geometry. Although it might feel like we technically have everything we need, alas, we do not - quite. Remember that our ultimate goal for our stakeholder is prediction - so what this means is that we need to determine whether spatial structure is a problem before we proceed with predictions. If it is, we can include spatially lagged variables in our analysis; if not, we can proceed with exactly what we pre-processed in Project 1.
But because our ultimate goal is to try to tie this back with all of
the work we did in Project 1, I will actually give you data, which you
get by running this code chunk. This returns both the training and
testing sets, aggregated but not yet merged with geometries, so we don’t
lose our other encoding and especially our transformations for OLS. They
are going to be called stateAggTrain and
stateAggTest, respectively.
source(file = "loadTrainTest.R")
NOTE 1 that these new aggregated training and
testing states have region in them instead of
state. This is so that they are already cleaned up for
merging with geometries.
NOTE 2: All of the raw variables have been named
Median_Raw... to distinguish them from their transformed
and scaled counterparts!
Check that all of the states (in region) made it into
the training set. If you choose to do this by counting, say with
unique() or distinct(), you should get
52 if they are all present.
# Count the states
length(unique(stateAggTrain$region))
## [1] 52
We indeed have 52 regions in the training data.
Why am I not asking you to check the states in the testing data?
Due to the test data being a smaller subset of the original data, it may not include all of the states.
In the aggregation that I did for both stateAggTrain and
stateAggTest I took the median for
everything. Why, when we just went through all the rigmarole of figuring
out if we should display the mean or the median of variables like
Predicted Pneumonia Readmissions?
HINT 1: What have we already done to the training and testing sets that we had not done here because we went back to the raw data?
HINT 2: Why is a median or a mean acceptable if we’re working with scaled and centered data?
Since we are now working with our transformed training and testing data, the distributions are now artifically changed. Data that has been scaled and centered will be roughly symmetrical, meaning that either the median or mean will be acceptable.
Here, we have selected state. We already decided this going into the previous analysis; it only makes sense to continue that level of spatial analysis first.
Sadly, what we used for mapping will not be sufficient. Recall that, in lecture, I kept referring to something called shapefiles and I said we wouldn’t have to worry about them a ton (yet) but that we were going to need them already. Well, here they are!
A shapefile is a widely used geospatial vector data format for
geographic information system (GIS) software, storing the location,
shape, and attributes of geographic features such as points, lines, and
polygons. This is is typically across a set of related files with file
extensions of .shp, .shx, or
.dbf. Because shapefiles are the
gold-standard of spatial analysis, our spatial statistics
packages in R similarly expect geometrics in shapefile
(often abbreviates as sf) format. We could either
import shapefiles (which we will do in Project 2) or, here, we will
simply convert the geometrics from the maps package into
shapefile format so we can calculate our spatial statistics.
Complicated, right? Thankfully, most of that magic will be performed
for us using the sf package in R. If you want to learn more
about the sf package, you can find that here.
We can convert to shapefile format in a single line of code:
## Get state map geometries in shapefile format
states_sf <- st_as_sf(map("state", plot = FALSE, fill = TRUE))
You may need to do a little digging, but what exactly is the
st_as_sf() function from the sf package doing?
Can you figure out why I didn’t just use the states object
we already used for mapping?
# Pull up documentation
?st_as_sf
The st_as_st() function converts data objects with spatial values into sf objects. In our case, the code is taking the state data from the maps package and converts it into an sf object.
The sf format might be prefered, as it stores the spatial data in a single geom column. This could make data transformations, analyses, and visualizations more straight forward and flexible.
Now, we finally get to the meat of it! Spatial weights represent the spatial relationships (or influence) between geographic units (here, states). In other words, spatial weights define how much nearby observations contribute to calculations for a given location. These weights are typically based on proximity, contiguity (shared borders), or distance, but it depends on the type of spatial autocorrelation or dependence in the data.
Recall that, in lecture, I mentioned that we were going to look at spatial autocorrelation. Spatial autocorrelation is the relative degree to which the value of a variable (e.g., hospital readmissions) at one location is similar to values at nearby locations. When nearby areas have similar values (e.g., high readmission rates clustered together), it’s called positive spatial autocorrelation (clustering); when nearby areas have very different values, it’s negative spatial autocorrelation (dispersal). Thus, you can think of spatial weights as measures of the spatial correlation in hospital readmissions between states, in our case!
We will use a package called spatial dependence (spdep,
more information here) to
calculate our spatial weights. We have a series of small sub-steps to go
through to get there.
By default, the sf package now uses Google’s S2 geometry
engine (a library for geometric calculations on the sphere developed by
Google for their maps), which is strict and throws errors when
geometries are even slightly invalid (like if edges should cross each
other even slightly). We will choose to turn off the S2 engine just for
this step because it will fail otherwise. This is often going to be the
case _when you are not importing shapefiles directly or if your
shapefiles are older and predate the S2 engine__. But, we will turn it
back on when we’re done.
We can disable the S2 engine like this:
## Disable Google S2 engine
sf_use_s2(FALSE)
We will also just do a quick cleanup using the
st_make_valid function just to be safe. It
never hurts! All it does is clean up any of those cases where the
borders do cross (i.e., are invalid) so we don’t get any errors.
We can validate the geometry edges like this:
## Clean up the edges of `states_sf` to be safe
states_sf <- st_make_valid(states_sf)
Find Neighboring States
Next, we will use the poly2nb() function from the
spdep package to create a neighborhood list based
on the polygon boundaries of our spatial objects (states). It is using
the shapefile (sf) object we already made, states_sf, to do
this. What it will do is defines which polygons (states) are neighbors
by looking for which ones share borders with each other (called
contiguity). By default, it uses what is called “queen”
continuity (derived from chess); it would define a neighboring state as
one that shares any point (edge OR corner) as neighbors.
As an example, Colorado and Arizona do not share a border only a corner, right? But according to “queen” contiguity, they are still neighbors.
Look up “rook” continuity and explain what (1) what it means, (2) how
the neighbor relationship of Colorado and Arizona would change, if it
would, and (3) how we change to this type of contiguity in the
poly2nb() function. (You don’t need to actually change it
in the code, however.)
Rook contiguity means that states are only considered neighbors when they share a border.
Using rook contiguity, Colorado and Arizona would not be consider neighbors, as they do not share an edge.
In the poly2nb() function, you can change to rook contiguity by setting the queen parameter to FALSE.
We can find neighboring states using queen contiguity like this:
## Create neighbors using queen contiguity
states_nb <- poly2nb(states_sf, queen = TRUE)
Calculate Spatial Weights Between Neighbors
Next, we will use the nb2listw() function from
spdep to convert the neighborhood list that we created with
poly2nb() (called states_nb) into a spatial
weights object that we can use in spatial analysis like Moran’s \(I\), LISA, and spatial regression. What
nb2listw() does is assigns weights to the neighbors
identified in the nb object, and those weights define how
strongly each neighbor influences a spatial unit (state).
There are actually several styles of weighting; we will be using
row-standardizing (style = "W") as it is the most
appropriate for spatial autocorrelation. Row-standardized weights are
spatial weights where each row sums to 1. This means the influence of
all neighbors on a given spatial unit is normalized, so the total
influence is consistent across all units regardless of how many
neighbors they have. Without standardization, units with more neighbors
could have a greater total weight, biasing spatial statistics. This
means that states like Colorado, that have lots of neighbors, are not
weighted more heavily than states like Alaska or Hawaii, which are
effectively islands. This will make Moran’s \(I\) more comparable across the
states.
We can calculate the spatial weights as folows:
## Calculate the state-level spatial weights
states_sw <- nb2listw(states_nb, style = "W")
## Turn Google S2 engine back on
sf_use_s2(TRUE)
The first state in our dataset is Alabama. Take a look at one of our maps if needed. How many neighbors do you expect it to have?
I would expect Alabama to have four neighbors (as it shares a border with four states).
Now investigate Alabama’s spatial weights:
states_sw$weights[[1]]
## [1] 0.25 0.25 0.25 0.25
Does this feel consistent (1) with what you know about the number of neighbors Alabama has and (2) what I mentioned above about row-standardizing our weights? Why or why not?
This feels consistent, as I had previously identified four neighbors. Since the weights are standardized, all the weights will add up to 1. Therefore, four weights all equal to 0.25 is exactly what we should expect.
Lastly, investigate the spatial weights for Massachusetts (number 20). Does it fit with your expectations given the number of neighbors?
# Massachusetts Weights
states_sw$weights[[20]]
## [1] 0.2 0.2 0.2 0.2 0.2
Once again, this meets expectations. There are five states bordering Massachusetts, and there are five weights of equal part that all add up to 1.
Join the Spatial Weights back to our Training Data
Our final step is to attach the neighbor lists and spatial weights back to the training data and the shapefile we already have. This way, we can proceed with the rest of our spatial analyses.
NOTE that if the shapefile spatial names
(ID) did not already match region in
stateAggTrain we’d have to convert that first! Further, it
is at this stage you would add FIPS to facilitate merging, if
needed.
We can calculate the spatial weights as folows:
## If we want to make sure that the shapefile ID matches region:
## unique(states_sf$ID)
## Join the shapefile with spatial weights and neighbor lists back to the
## training data
states_sf_AggTrain <- left_join(states_sf, stateAggTrain,
by = c("ID" = "region"))
I am sure you caught this, but spatial weights and everything are being done on the training data only. What key issue am I trying to avoid by performing all of my spatial pre-processing steps on the training data only? Why am I not using the full dataset for this?
HINT: It is the same reason we did all of our other pre-processing AFTER splitting the data!
We want to avoid data leakage, which could occur if we do our pre-processing on the full data.
Recall from lecture that I mentioned that our goal is to determine if there is clustering (positive spatial autocorrelation) or dispersal/dispersion (negative spatial autocorrelation). Global Moran’s \(I\) is a way to measure clustering and will tell us whether clustering exists overall. Moran’s \(I\) statistic ranges from -1 to +1, just like a typical correlation coefficient! Significant positive \(I\) inidicates clustering of similar values (high with high, low with low) whereas significant negative \(I\) indicates dispersion (high values near low values; or “opposites attract”). Just as with a canonical correlation, \(p\)-values < 0.05 suggest significant spatial autocorrelation (assuming we are using the 5% significance threshold). It’s important to note that an \(I\) = 0 would indicate randomness; thus, the null hypothesis (\(H_0\)) is that median pneumonia readmission rates are randomly distributed across the states.
We are going to set the target like this to facilitate making it easier to swap out at a later question. NOTE that we are centering but NOT scaling it - why? This is because we want it to be comparable to the lagged values; this will make a little more sense as we go.
## Change this to change the target!
y <- scale(states_sf_AggTrain$Median_RawPredictedReadmissionRate,
center = TRUE, scale = FALSE)
Using the moran.test() function from the spatial
dependence (spdep) package, we will calculate the global
Moran’s \(I\) statistic first. If you
would like more details about this function, you can find that here.
## Calculate the global Moran's I for states on training data
## Use scaled y & apply spatial weights
global_moran <- moran.test(y, states_sw)
## View the results
global_moran
##
## Moran I test under randomisation
##
## data: y
## weights: states_sw
##
## Moran I statistic standard deviate = 2.9895, p-value = 0.001397
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.268531084 -0.020833333 0.009368833
## Show a plot of the top most outlier states
moran.plot(as.vector(y),
listw = states_sw,
ylab = "Spatially-lagged Median Pneumonia Readmissions",
xlab = "Median Pneumonia Readmissions (No Lag)", pch = 16,
col = skittles[2], xlim = c(-3,3),
main = "Figure 3. Global Moran's I (State-Level Aggregation)")
Interpretation: Our global spatial autocorrelation test found a Moran’s \(I\) of 0.268 with a \(p\)-value of 0.0014, which is less than 0.05. Thus, we reject the \(H_0\) of a random distribution and conclude that we have significant positive spatial autocorrelation for median pneumonia-related readmissions! This suggests that states with similar median readmission rates tend to be clustered geographically in the U.S.!
The accompanying scatterplot produced with moran.plot()
confirms this. On the \(x\)-axis we
have median pneumonia-related readmission rates for each spatial unit
(state) and on the \(y\)-axis the
spatially lagged version of readmission rates. Spatial lags are
the weighted average value of neighboring states (spatial
units). Thus, this plot shows that high readmission states tend
to cluster with other high states and low states with other low states
(i.e., positive spatial autocorrelation). Outlier states in each of the
quadrants of the graph are also identified.
Identify each of the outliers from the graph and interpret the type of outlier (high-high, low-high, etc.) that they are.
Delaware: Low-high outlier; this state has low readmission rates but is surrounded by states that have high readmission rates.
Utah: Low-low outlier; this state has low readmission rates and is surrounded by similar neighboring states.
West Virginia, DC, New Jersey: High-high outliers; these states have high readmission rates are are surrounded by similar states.
NOTICE that I had you define the target as a
centered version of
Median_RawPredictedReadmissionRate, which is the median of
the untransformed, unscaled readmissions. Shouldn’t I have used
the transformed and scaled version,
bc_PredictedReadmissionRate? Can you think of any reason
why I chose to use the centered median of the raw readmissions rather
than the transformed version?
Centering simply shifts the data so that it is centered upon zero. This can help when comparing to lagged y. By having the data centered, we can use the quadrants to compare if points fall above or below the median. Unlike centering, scaling alters the spread of the data, which would make interpretation in the context of the original units more difficult. The box-cox transformation would also change the interpretation of the data, also hindering our ability to use it to compare to lagged y.
HINT 1: What does centering do that scaling does not? Could centering be important, for example, to make sure that \(y\) is going in the same direction as lagged \(y\)?
HINT 2: Why do we need \(y\) and lagged \(y\) going in the same general directions? Could it have anything to do with identifying which quadrant they fall into?
HINT 3: Chapter 8 by Paul Moraga on Spatial Autocorrelation may be helpful, but it also gets a bit deep in the weeds too.
Now go back up to temporarily reset the target to
bc_PredictedReadmissionRate here.
There is no need to scale it again! Then, re-run the
Global Moran’s \(I\). make sure to
update the limits of the \(x\)-axis or
just comment that line out (xlim(...)) to look at the
graph. Looking at the value of \(I\),
the \(p\)-value, and the graph output -
does the conclusion change? Does it matter which value we choose for the
analysis?
The Moran I statistic is 0.289 with a p-value of 0.0007. Since the p-value is less than 0.05, we still reject the null hypothesis. The graph also appears to be more or less the same. Overall, it does not appear as if the choice makes a large difference.
NOTE: Make sure to change \(y\) back to the scaled version of
Median_RawPredictedReadmissionRate before moving to the
next analysis!
Now that we’ve gotten a sense for possible outlier states and we know that we have global spatial autocorrelation in median readmission rates, it would be nice to concretely identify hotspots and coldspots. Unlike the global Moran’s \(I\), which provides an overall measure of spatial autocorrelation across the entire nation, LISA (Local Indicators of Spatial Association) helps pinpoint specific states that significantly contribute to that global pattern. LISA identifies clusters of similar values, such as high-high (hotspots) and low-low (coldspots). It can also be used to identify spatial high-low and low-high outliers, just as the global test did. This local approach is so powerful though because it enables us to map and interpret where spatial clustering or dispersion is happening, which is especially valuable for targeted policy intervention or resource allocation. So, even though it may not be as vital for our predictions, it can help us to better understand the spatial pattern of pneumonia-related hospital readmissions more generally.
To perform the LISA, we will use the localmoran()
function from spdep. More information on the function can
be found here.
After, instead of printing the summary() I show you the
head() of the first rows produced by the LISA.
## Run local Moran's I for LISA on vectorized y & applying same spatial weights
local_moran <- localmoran(as.vector(y), states_sw)
| Ii | E.Ii | Var.Ii | Z.Ii | Pr(z != E(Ii)) | |
|---|---|---|---|---|---|
| alabama | -0.04 | 0.00 | 0.03 | -0.26 | 0.79 |
| arizona | 0.11 | 0.00 | 0.03 | 0.68 | 0.50 |
| arkansas | 0.00 | 0.00 | 0.00 | 0.02 | 0.98 |
| california | -0.12 | -0.04 | 0.54 | -0.11 | 0.91 |
| colorado | 0.79 | -0.01 | 0.06 | 3.36 | 0.00 |
| connecticut | 0.06 | 0.00 | 0.00 | 1.12 | 0.26 |
Ii = Stands for \(I_i\), which is the local Moran’s \(I\) estimate for that state. \(I_i\) is a measure of local spatial
autocorrelation for the given state. If positive, it is similar to its
neighbors, if negative it is dissimilar.
E.Ii = Stands for \(\mathbb{E}(I_i)\), which is the expected
value of \(I_i\) under the null
hypothesis of spatial randomness. This value is usually close to
0.
Var.Ii = Stands for \(Var(I_i)\), which is the variance of \(I_i\) under the null hypothesis. It is used
to assess the significance of \(I_i\).
Z.Ii = Stands for \(Z_{I_i}\), which is the \(Z\)-score calculated as \(Z = \frac{I_i −
\mathbb{E}(I_i)}{\sqrt{Var(I_i)}}\), which indicates how extreme
\(I_i\) is compared to the random
expectation. Used to derive statistical significance.
Pr(z!=E(Ii)) = Stands for \(Probability(Z_{I_i} \ne \mathbb{E}(I_i))\),
which is the probability that \(I_i\)
is not equal to the expected value \(\mathbb{E}(I_i)\). In other words, it’s the
p-value associated with \(Z_{I_i}\)!
So, it tells you whether the local spatial autocorrelation is
statistically significant for any given state.
That might feel like a lot of confusing math, but let’s start to put it together into something meaningful we can use: HOTSPOTS vs. COLDSPOTS.
Hotspots (High-High). States with a high median readmission value surrounded by other high value states. As you get to the fringes of hotspots, other states that have high readmission rates but are surrounded by low states will not be labeled as part of the hotspot but could be labeled an outlier (high-low).
Coldspots (Low-Low). States with a low median readmission value surrounded by other low value states. As you get to the fringes of coldspots, other states that have low readmission rates but are surrounded by high states will not be labeled as part of the coldspot but could be labeled an outlier (low-high).
We can actually break this down a bit to help us better understand the difference between hot/coldspots and outliers - outliers are states with high readmission values that are near states really different from them in terms of readmission rates.
| Readmissions | Neighbors | LISA Category | Quadrant | Explanation |
|---|---|---|---|---|
| High | High | Hotspot (H-H) | I | State with high readmissions is surrounded by other high states |
| Low | Low | Coldspot (L-L) | III | State with low readmissions is surrounded by other low states |
| High | Low | Outlier (H-L) | IV | State with high readmissions is surrounded by low rate states |
| Low | High | Outlier (L-H) | II | State with low readmissions is surrounded by high rate states |
| High or Low | Random | Not Sign. (N.S.) | - | No difference from random (dispersed) |
This implies, then, that if we identify the quadrants we identify the hot/coldspots! So, let’s do that.
Hotspots (Quadrant I, “High-High”) will be states that have:
A significant \(p\)-value (\(p < 0.05\)), which suggests that local \(I_i\) is significantly positive (\(I_i > 0\))
A centered median hospital pneumonia-related readmissions rate that is positive
A spatially-lagged median hospital pneumonia-related readmissions rate that is positive
What will the criteria for coldspot (quadrant III) states be?
Coldspots (Quadrant II, “Low-Low”) will be states that have:
A significant p-value (p < 0.05), which suggests that the local Ii is significantly negative (Ii < 0)
A centered median hospital pneumonia-related readmissions rate that is negative
A spatially-lagged median hospital pneumonia-related readmissions rate that is negative
We can use the lag.listw() to extract the spatial lag of
the target. So, these values are the spatially-adjusted
median pneumonia-related
lag_medianReadmissions <- lag.listw(states_sw, y)
## Run this is you want to see what they look like!
## lag_medianReadmissions
## Just Massachusetts' value:
lag_medianReadmissions[20]
## [1] 0.2126469
The raw target is the % of all pnuemonia patients discharged who are predicted to be readmitted within 30 days after being released from the hospital. So, e.g., the raw median pneumonia-related readmissions rate for Massachusetts is 16.6789, meaning that \(\approx\) 16.7% (no multiplication by 100 needed here!) are predicted to be readmitted based on a sliding-window average of 30-day readmissions.
Yet, the spatially-lagged pneumonia-readmissions for Massachusetts is only 0.2126469!
QUESTION: Why do the spatially-lagged variables seem SO shifted compared to the original value?
HINT: What did we do to the raw hospital readmissions before we calculated the spatial weights in this section?
The spatially-lagged variables seem shifted compared to the original value because we had previosuly centered the raw data before calculating spatial weights. Reversing the centering of the lagged data would help get the values closer to what is expected. Otherwise, we could center the raw data.
Show just Massachusetts’ median readmissions to see its value after performing the step I am referring to in the hint.
HINT: Massachusetts is at row 20 in the
states_sf_AggTrain dataframe, which will make it easier to
locate its value in the vector created by looking at just the
Median_RawPredictedReadmissionRate column.
# Get the mean used for centering
center_value <- attr(y, "scaled:center")
# Add the mean back to the centered data
y_original <- y + center_value
lag_medianReadmissions2 <- lag.listw(states_sw, y_original)
## Run this is you want to see what they look like!
## lag_medianReadmissions
## Just Massachusetts' value:
lag_medianReadmissions2[20]
## [1] 16.2344
# Centering the Raw data
centered_values <- scale(states_sf_AggTrain$Median_RawPredictedReadmissionRate, center = TRUE, scale = FALSE)
# Show Massachusetts' centered value
centered_values[20]
## [1] 0.6571469
QUESTION: Is this a lot closer to the spatially-lagged value?
Yes, the values are more comparable.
local_df <- local_moran %>% as_data_frame()
states_sf_AggTrain datasetstates_sf_AggTrain <- states_sf_AggTrain %>%
mutate(
## adds the local I
localI = local_df$Ii,
## adds the p-value
pValue = local_df$`Pr(z != E(Ii))`,
## adds the lagged median readmissions
lag_medianReadmissions = lag_medianReadmissions
)
The criteria we used (in case you already forgot!) were set here. All of the steps might make it easy to forget what we are trying to do, but we’re finally at the important part: identifying the quadrants.
The code below identifies the quadrants and adds them to the
states_sf_AggTrain dataset. Your task is to add comments to
the code below where indicated. Make sure to refer back to the criteria
we set here if needed!
Answer in the comments.
# Update states_sf_AggTrain with a new quadrant column
states_sf_AggTrain <- states_sf_AggTrain %>%
mutate(
## define values in quadrant by the following cases
quadrant = case_when(
# If the p-value is greater than 0.05, label the point as N.S.
pValue > 0.05 ~ "N.S.",
## COMMENT HERE - Why do I not have to specify p-value again?
# All other cases involve p-values below 0.05.
# A point is a hotspot when it falls into quadrant 1.
# This is defined as y and lagged y both above 0.
y > 0 & lag_medianReadmissions > 0 ~ "Hotspot",
# A point is a coldspot when it falls into quadrant 3.
# This is defined as y and lagged y both below 0.
y < 0 & lag_medianReadmissions < 0 ~ "Coldspot",
# A point is a high-low outlier when it falls into quadrant 2.
# This is defined as y above 0 and lagged y below 0.
y > 0 & lag_medianReadmissions < 0 ~ "High-Low Outlier",
# A point is a low-high outlier when it falls into quadrant 4.
# This is defined as y below 0 and lagged y above 0.
y < 0 & lag_medianReadmissions > 0 ~ "Low-High Outlier",
# Assign NA if data meets none of the described cases
TRUE ~ NA
)
)
Lastly, let’s visualize our hot/cold spots on the map! Note that we
are now using geom_sf() for mapping rather than
geom_polygon(). This is because our dataframe now contains
shapefile geometries instead of just the polygons that
it did earlier, hence the change.
ggplot(states_sf_AggTrain) +
## Now using geom_sf() because we have shapefile data rather than just
## polygons; geom_polygon() will no longer work here!
geom_sf(aes(fill = quadrant), size = 0.2,
alpha = 1, color = "darkgray") +
## Fill based on the quadrants!
scale_fill_manual(name = "LISA Cluster (Quadrant)",
values = c("Hotspot" = skittles[6],
"Coldspot" = skittles[1],
"High-Low Outlier" = skittles[7],
"Low-High Outlier" = skittles[2],
"N.S." = "#dadbd7")) +
## Makes a completely blank map background
theme(
panel.grid = element_blank(),
panel.background = element_rect(fill = "white"),
plot.background = element_rect(fill = "white"),
axis.text = element_blank(),
axis.ticks = element_blank(),
axis.title = element_blank(),
legend.position = "bottom"
) +
labs(title = "LISA Clustering of Median Pneumonia-related Hospital Readmission Rates",
subtitle = "Hotspots and coldspots show spatial clusters of similar values",
caption = "Source: Centers for Medicare & Medicaid Services (CMS), 2024")
Interpret what it means if Indiana is a “low-high” outlier.
If Indiana is a “low-high” outlier, this means that the state has unusually low readmission rates for pneumonia when compared to its neighbors.
Do the hot and coldspots make sense with your observations of graph
p1? Why or why not? Can you explain why hot/coldspots will
not necessarily correspond with individual states’ high
and low median readmission rates?
The coldspots appear to be roughly consistent with what is demonstrated in p1. Furthermore, many of the hotspots are shown in p1 to have decently high readmission rates. However, many of the very high rate states in p1 are labeled as N.S. on this new map. The difference between p1 and the hot/coldspot map is that while p1 is just looking at individual state performance, the hot/coldspot map is comparing states to their neighbors. Thus, it is highlighting how states compare to the states near it.
Spatial regression models explicitly account for spatial relationships in data, which is something we encounter constantly in public health and epidemiology, as well as environmental studies (e.g., climate change, pollution, wildfire risk), agriculture (e.g., soil nutrients, water availability), sociology and social work (e.g., poverty, social justice), criminology (e.g., real-time crime predictions), and business (e.g., real-estate and housing prices). In other words, spatial regression gives us tools to analyze how nearby locations influence each other, especially when spatial relationships violate the assumption of independence in OLS regression!
There are two common issues that spatial regression addresses:
Spatial dependence, in which outcomes in nearby locations tend to be similar (e.g., rates of a transmissable disease tend to be higher in nearby areas vs. farther areas).
Spatial heterogeneity, in which relationships between variables may vary across space (e.g., housing prices, climate, or wildfire risk depend on geographic features of the landscape, sometimes features we have not even measured).
There are three types of spatial regression models worth mentioning. The spatial lag model (aka, Spatial Autoregressive Model or SAR) is the one we will focus on first. In this regression model, the outcome depends on neighboring outcomes. This makes sense for us because we know from the LISA analysis that we have hot/coldspots of pneumonia-related readmissions!
The general structure of the SAR model is:
\(y = \rho Wy + \beta_1 X_1 + \beta_2 X_2 + ... \beta_n X_n +\epsilon\)
where \(y\) is the target variable
(median pneumonia-related readmissions), \(W\) is the spatial weights matrix, which
defines the neighborhoods (stored in states_sw!), and \(\rho\) is the _spatial lag coefficient__.
This means that the first term of the equation, \(\rho Wy\), measures the spatial lag of the
target variable as the average of neighbors’ outcomes multiplied by the
spatial lag coefficient. It tells us how much the outcome
deviates just do to spatial relationships with neighbors!. The
remaining terms, \(\beta_1 X_1 + \beta_2 X_2 +
... \beta_n X_n\) are just the standard regression coefficients
for each of the \(n\) predictors. \(\epsilon\) is an unmeasured error term,
which we also have in OLS regression.
We are justified in proceeding with a SAR because, as you hopefully remember from the end of Project 1, pneumonia-related hospital readmissions could be predicted with an OLS regression! However, we also know we have spatial patterns here that we want to try account for in our predictive models, so let’s do that now.
Your data have already been partitioned into training and testing sets, preprocessed and normalized, and we just calculated the spatial weights matrix, \(W\). What if we did NOT already have the data fully pre-processed? In a few sentences, describe the steps we would have to take if we were starting from the raw data.
HINT: Think of the data science lifecycle/flow we keep coming back to!
The following is an example of how we might have pre-processed the data.
First, knowing our target, we would drop the potential targets that we did not want to keep, as well as variables with no/little variance.
Second, we would split the data into our training and testing sets. We do this now to avoid data leakage later on.
Third, we would impute missing values, using a method like missForest or MICE. In addition, we would apply our frequency encoding.
Fourth, we would apply our transformation (box-cox), centering, and scaling.
Fifth, we would repeat the steps at the beginning of this demo, making sure that we properly aggregate our data at the state level, select the median data, and calculate our spatial weights.
sarTrain ## Start from the training data
sarTrain <- states_sf_AggTrain %>%
## Make it a dataframe/tibble
as_data_frame() %>%
## Drop the features that we dropped before the analysis in Project 1, as
## well as the features we do NOT want to use as predictors!
select(-Median_RawPredictedReadmissionRate,
-Median_RawMedicareSpending,
-Median_RawHospitalReturnDays,
-Median_RawDeathRate,
-ID,
-localI,
-pValue,
-quadrant,
-geom,
-lag_medianReadmissions)
If we didn’t already know that we have spatial relationships from the Moran’s \(I\) and the LISA, we could also consider calculating a Moran’s \(I\) on the residuals of the OLS regression. This would tell us, even if we had NOT yet done a Moran’s \(I\) or LISA, that whether we should pay more attention to our spatial relationships and explicitly include them in our regression model.
\(H_0\): The null hypothesis is that the residuals are not spatially autocorrelated. Thus, the OLS is not misspecified and we do not need to worry about spatial relationships in our target.
## Moran's I test on residuals
moran.test(residuals(OLS), states_sw)
##
## Moran I test under randomisation
##
## data: residuals(OLS)
## weights: states_sw
##
## Moran I statistic standard deviate = 1.5437, p-value = 0.06133
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.128852929 -0.020833333 0.009402604
The Moran I statistic is 0.129, and the p-value is 0.06, which is greater than 0.05. We conclude that we fail to reject the null hypothesis. Our residuals do not appear to be spatially autocorrelated.
Yes, because the model accounts for the spatial autocorrelation, masking the underlying spatial relation of the data.
The two analyses are focused on different goals. The first tests we did were looking to identify spatial clustering within the raw data. The test performed after modeling seeks to identify any spatial structure leftover after creating our model.
The lagsarlm() from the spatialreg package
will fit the SAR for us, allowing us to specify the spatial weights
matrix, states_sw, that we calculated all the way back here.
## SAR (spatial lag model)
SAR <- lagsarlm(bc_PredictedReadmissionRate ~ ., data = sarTrain,
listw = states_sw)
summary(SAR)
##
## Call:lagsarlm(formula = bc_PredictedReadmissionRate ~ ., data = sarTrain,
## listw = states_sw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.657380 -0.240268 -0.011297 0.205659 0.748405
##
## Type: lag
## Coefficients: (asymptotic standard errors)
## Estimate
## (Intercept) 0.0037668
## MRSA.Bacteremia -0.1085416
## Abdomen.CT.Use.of.Contrast.Material -0.0323645
## Death.rate.for.pneumonia.patients -0.0659404
## Postoperative.respiratory.failure.rate -0.5704259
## Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate 0.9855298
## Composite.patient.safety -0.3809309
## Medicare.spending.per.patient 0.4864549
## Healthcare.workers.given.influenza.vaccination 0.0336501
## Left.before.being.seen 0.0319323
## Venous.Thromboembolism.Prophylaxis -0.1223090
## Doctor.communication 0.3979458
## Communication.about.medicines -0.2554363
## Discharge.information -0.2890223
## Cleanliness -0.2588597
## Quietness -0.3024060
## Hospital.return.days.for.pneumonia.patients 0.0927036
## Std. Error
## (Intercept) 0.0803023
## MRSA.Bacteremia 0.2512056
## Abdomen.CT.Use.of.Contrast.Material 0.1629524
## Death.rate.for.pneumonia.patients 0.2029840
## Postoperative.respiratory.failure.rate 0.3537272
## Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate 0.2706225
## Composite.patient.safety 0.2174500
## Medicare.spending.per.patient 0.1476588
## Healthcare.workers.given.influenza.vaccination 0.1160326
## Left.before.being.seen 0.1466825
## Venous.Thromboembolism.Prophylaxis 0.2325981
## Doctor.communication 0.2050351
## Communication.about.medicines 0.2433874
## Discharge.information 0.1856011
## Cleanliness 0.2002415
## Quietness 0.1290785
## Hospital.return.days.for.pneumonia.patients 0.1329591
## z value Pr(>|z|)
## (Intercept) 0.0469 0.9625866
## MRSA.Bacteremia -0.4321 0.6656812
## Abdomen.CT.Use.of.Contrast.Material -0.1986 0.8425655
## Death.rate.for.pneumonia.patients -0.3249 0.7452908
## Postoperative.respiratory.failure.rate -1.6126 0.1068281
## Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate 3.6417 0.0002708
## Composite.patient.safety -1.7518 0.0798066
## Medicare.spending.per.patient 3.2945 0.0009861
## Healthcare.workers.given.influenza.vaccination 0.2900 0.7718120
## Left.before.being.seen 0.2177 0.8276656
## Venous.Thromboembolism.Prophylaxis -0.5258 0.5990004
## Doctor.communication 1.9409 0.0522745
## Communication.about.medicines -1.0495 0.2939458
## Discharge.information -1.5572 0.1194175
## Cleanliness -1.2927 0.1961020
## Quietness -2.3428 0.0191393
## Hospital.return.days.for.pneumonia.patients 0.6972 0.4856566
##
## Rho: 0.33291, LR test value: 4.7905, p-value: 0.028617
## Asymptotic standard error: 0.13262
## z-value: 2.5102, p-value: 0.012065
## Wald statistic: 6.3013, p-value: 0.012065
##
## Log likelihood: -12.75033 for lag model
## ML residual variance (sigma squared): 0.095777, (sigma: 0.30948)
## Number of observations: 49
## Number of parameters estimated: 19
## AIC: NA (not available for weighted model), (AIC for lm: 66.291)
## LM test for residual autocorrelation
## test value: 0.05318, p-value: 0.81762
The summary() output of the SAR model shows a lot going
on. Let’s go through the three key pieces of evidence we want to assess
from these results.
Here, we find that \(\rho = 0.33291\) with an estimated p-value of \(p = 0.012065\) based on a Z-statistic of \(Z = 2.5102\). What the Wald Z-test is doing is testing whether \(\rho = 0\), just like in Pearson correlation tests!
CONCLUSION: With a p-value LESS than 0.05, we would conclude that \(p \ne 0\) - thus there is sufficient spatial dependency in the target variable to justify including the lag term in this model. That’s opposite what the Moran’s \(I\) on the residuals implied but feels very consistent with what we found for both global and local Moran’s \(I\) earlier.
More generally, Likelihood Ratio Tests (LRTs) are used to compare well one nested regression model fits compared to an unnested model. If you’ve ever done stepwise regression, you’ve inherently done a Likelihood Ratio Test possibly without knowing it! Note that because SAR is a nested model, it reduces to OLS when the spatial lag coefficient \(\rho = 0\). When there is NO spatial lag (\(\rho = 0\)), the first term of the SAR model cancels out!
The lagsarlm() unfortunately will not automatically
compare the Spatial Lag Model (SAR) with the Ordinary Least Squares
(OLS) model - but we can. We will do this by separately
performing a Likelihood Ratio Test (LRT). But the LRT test is very
simple. The formula for the likelihood ratio (LR) of the two models can
be found by:
\(LR = 2 \times (\log L_{SAR} - \log L_{OLS})\)
where \(\log L_{SAR}\) is the log-likelihood of the SAR model and \(\log L_{OLS}\) of the OLS model. The LR test statistic follows a \(\chi^2\) distribution with 1 degree of freedom.
But if that feels too mathy for your, just remember this: A LR of 0 means there is NO difference between the two models being compared, but the bigger the LR the higher the probability that the two models are different. Thus, our \(H_0\) (null hypothesis) is that there is no difference between the SAR model and the OLS. So, if our \(p\)-value is less than 0.05, we will reject that null hypothesis.
In the output, the \(\log L_{SAR} =
-12.75\) but the \(\log
L_{OLS}\) isn’t shown. We can very quickly perform the LRT in
R using the anova() function.
NOTE that this function is confusingly named here; it
will either do an Analysis of Variance (ANOVA) or an Analysis of
Deviance, also known as the Likelihood Ratio Test which is what we are
doing here!
## Give it the larger model first, then the null model.
## SAR is our larger model; the null OLS is nested within the SAR because
## the SAR = OLS when rho = 0!
anova(SAR, OLS)
## Model df AIC logLik Test L.Ratio p-value
## SAR 1 19 63.501 -12.750 1
## OLS 2 18 66.291 -15.146 2 4.7905 0.028616
This tells us that the log-likelihood of SAR (\(\log L_{SAR}\)) is -12.75, the log-likelihood of OLS is (\(\log L_{OLS}\)) is -15.146, and the likelihood ratio (LR) statistic is 4.7905 with an estimated \(p\)-value of 0.028616. Although it does not say this, the degrees of freedom is 1 here: 19 df from SAR - 18 df from OLS.
Thus, if we wanted to do it by “hand” (i.e., calculate it according to the formula), it would look like this:
## Likelihood ratio test by "hand"
## Extract the log-likelihoods for each model
logLik_OLS <- logLik(OLS)
logLik_SAR <- logLik(SAR)
## Calculate the likelihood ratio (LR) - make sure to use as.numeric()!
LR <- 2 * (as.numeric(logLik_SAR) - as.numeric(logLik_OLS))
## Find the estimated p-value for the LR statistic using the chi-squared
## distribution. Use the df = 19-18 that we observed above.
p <- pchisq(LR, df = 1, lower.tail = FALSE)
## Print the results!
cat("The likelihood ratio LR =", LR, "with 1 degree of freedom has p =", p)
## The likelihood ratio LR = 4.790534 with 1 degree of freedom has p = 0.02861655
Which is the same as what the anova() function gave
us!
CONCLUSION: With a \(p\)-value LESS than 0.05, we would conclude that there is a difference between the two models. The more complex model, SAR, adds sufficient information to the model to help its fit. Thus, there is sufficient evidence that a spatial model improves our predictive ability over what an OLS alone would. Again, that’s opposite what the Moran’s \(I\) on the residuals implied but feels very consistent with what we found for both global and local Moran’s \(I\) earlier.
The LM test is based on the spatially lagged residuals and compares the residual spatial pattern to what would be expected under independence - meaning that it’s similar to the test we did earlier when we tested whether there was a spatial pattern to the OLS residuals! These are very similar tests, albeit calculated in slightly different ways.
The null hypothesis (\(H_0\)) again is that there is NO spatial pattern to the residuals; the difference now is that we are testing if there’s still “leftover” spatial autocorrelation in the residuals AFTER fitting our SAR model. In other words - did the SAR model do a sufficiently good job at dealing with spatial autocorrelation? If yes, yay! We’re done! If no, we’d need to do a Spatial Error Model (SEM) or Spatial Durbin Model (SDM), the other two types of regression I alluded to wayyyy back at the beginning of this section.
Our LM test statistic, per the summary() output, is
0.05318 with a \(p\)-value of
0.81762.
CONCLUSION: With a \(p\)-value MORE than 0.05, we could conclude that there is NO residual spatial autocorrelation after fitting the SAR. Thus, the SAR model is sufficient for explaining the spatial autocorrelation in the target!
Briefly summarize everything we know at this point: (1) Global Moran’s \(I\) results, (2) Local Moran’s \(I\) (LISA) results, (3) Moran’s \(I\) on the residuals of the OLS model, (4) the test for spatial autoregression using the coefficient \(\rho\), (5) the Likelihood Ratio Test (LRT), and (6) the LM test for residual spatial autocorrelation after fitting the SAR model. DO NOT simply copy my interpretations but instead try to make sure you understand each of these pieces of evidence and what they tell us.
- From the Global Moran’s I results, we see that the data produces a Moran’s I of 0.268 and a p-value of 0.0014. The purpose of the Global Moran’s I is to test if our data shows spatial autocorrelation. Since our p-value is below 0.05, we reject the null hypothesis. We can conclude that our data has significant positive spatial autocorrelation.
- Our purpose of looking at LISA results is to see if we can identify where the spatial clusters are forming, and what they mean. From the results, we identified two groups. One was a coldspot cluster that consisted of Idaho, Wyoming, Nebraska, Utah, and Colorado. The other was a hotspot cluster that consisted of Ohio, Pennsylvania, Virginia, and Maryland. There was also one identified low-high outlier, being Indiana. This demonstrated an interesting cold/hotspot grouping towards the west and east sides of the country respectively.
- The Moran’s I for the residuals of the OLS model obtained a Moran’s I of 0.129 and a p-value of 0.06133. Since the p-value is greater than 0.05, we fail to reject the null hypothesis and cannot conclude that there is spatial autocorrelation in the model residuals.
- The test for spatial autoregression using the coefficient rho also looks at spatial dependency within the model. However, counter to our previous conclusion, our test obtained a p-value of 0.012065, suggesting that there is in fact spatial dependency in the model.
- The Likelihood Ratio Test is used to determine if a spatial model would be more effective than just an OLS model. Our LRT obtained a p-value below 0.05, suggesting that a spatial model does in fact improve our prediction ability when compared to an OLS model.
- The LM test for Residual Spatial Autocorrelation assesses if there is a spatial pattern within our residuals after fitting the SAR model. The test obtained a p-value greater than 0.05, telling us that there is no spatial pattern in the model residuals. This suggests that the SAR model properly accounts for the spatial autocorrelation in the target.
What is your final takeaway? Do we need to account for spatial autocorrelation in our regression model?
Yes, we do need to account for spatial autocorrelation in our regression model. The first two tests demonstrate the precence of spatial autocorrelation within the data, even pinpointing the location of spatial clusters. Then, we were able to demonstrate that the spatial model outperformed the OLS model and sufficiently accounted for the autocorrelation.
| Coefficient | Beta1 | SE | p-value |
|---|---|---|---|
| rho | 0.33* | 0.133 | 0.012 |
| (Intercept) | 0.00 | 0.080 | >0.9 |
| MRSA.Bacteremia | -0.11 | 0.251 | 0.7 |
| Abdomen.CT.Use.of.Contrast.Material | -0.03 | 0.163 | 0.8 |
| Death.rate.for.pneumonia.patients | -0.07 | 0.203 | 0.7 |
| Postoperative.respiratory.failure.rate | -0.57 | 0.354 | 0.11 |
| Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate | 0.99*** | 0.271 | <0.001 |
| Composite.patient.safety | -0.38 | 0.217 | 0.080 |
| Medicare.spending.per.patient | 0.49*** | 0.148 | <0.001 |
| Healthcare.workers.given.influenza.vaccination | 0.03 | 0.116 | 0.8 |
| Left.before.being.seen | 0.03 | 0.147 | 0.8 |
| Venous.Thromboembolism.Prophylaxis | -0.12 | 0.233 | 0.6 |
| Doctor.communication | 0.40 | 0.205 | 0.052 |
| Communication.about.medicines | -0.26 | 0.243 | 0.3 |
| Discharge.information | -0.29 | 0.186 | 0.12 |
| Cleanliness | -0.26 | 0.200 | 0.2 |
| Quietness | -0.30* | 0.129 | 0.019 |
| Hospital.return.days.for.pneumonia.patients | 0.09 | 0.133 | 0.5 |
| Abbreviations: CI = Confidence Interval, SE = Standard Error | |||
| 1 *p<0.05; **p<0.01; ***p<0.001 | |||
We can think of dealing with spatial spillover as helping us distinguish the wheat from the chaff (truth from noise). Interpret which of the coefficients are significant (\(p\)-value < 0.05) AND compare that to both Table 10 (parsimonious OLS model results) AND Figure 20 (Feature Importance from Elastic Net) from Project 1. Now that we are accounting for spatial autocorrelation, which predictors are still important? Did any become significant (important) now that weren’t before?
The following coefficients are significant, as they have a p-value < 0.05:
rho Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate Medicare.spending.per.patient Quietness
Comparing to what was found in Project 1, only Medicare spending and Quietness remain significance. In addition, Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate has gained signifiance.
Recall that our goal for our stakeholder is to be able to predict(!) which hospitals perform better than others. They want to make a deployable tool for patients/customers. So, let’s test the predictive ability of our model by using our testing data to calculate the \(RMSE\).
This is called “spatially embedding” our testing data. Notice that we are using the spatial weights from the training set to prevent data leakage!!
## Join the shapefile with spatial weights and neighbor lists back to the
## testing data
states_sf_AggTest <- left_join(states_sf, stateAggTest,
by = c("ID" = "region"))
## Start from the testing data
sarTest <- states_sf_AggTest %>%
## Make it a dataframe/tibble
as_data_frame() %>%
## Move the ID column to the rownames
column_to_rownames("ID") %>%
## Drop the features that we dropped before the analysis in Project 1, as
## well as the features we do NOT want to use as predictors!
select(-Median_RawPredictedReadmissionRate,
-Median_RawMedicareSpending,
-Median_RawHospitalReturnDays,
-Median_RawDeathRate,
-geom) %>%
## Have to drop any missing values from the testing data
drop_na()
The spatialreg package has its own version of the
predict function, which will take the SAR model, the
testing data, and the spatial weights (states_sw).
predictions <- predict(SAR, newdata = sarTest, listw = states_sw) %>% as_tibble()
## Returns as a list; so make sure to turn into a dataframe for convenience
| State | fit | trend | signal |
|---|---|---|---|
| alabama | -0.29 | -0.36 | 0.07 |
| arizona | 0.31 | 0.49 | -0.19 |
| arkansas | -0.32 | -0.32 | 0.01 |
| california | 0.30 | 0.49 | -0.19 |
| colorado | -0.75 | -0.50 | -0.25 |
| connecticut | 0.85 | 0.51 | 0.34 |
We can see that the predictions contain the following components:
fit: the predicted outcome \(\hat{y}\); here, this includes the
influence of the spatial lag!trend: the combination of linear predictor(s) for each
value of the target \(y\); here, this
is a combination of all \(\beta_1X_1 + ...
\beta_nX_n\) in our model - so all of the predictors and their
coefficientssignal: the spatial term, \(\rho Wy\); this is the influence that the
spatial autocorrelation has on each value of the target \(y\)We will use the fit component from the predictions to
calculate RMSE.
## Calculate RMSE using the `fit` values
sarRMSE <- sqrt(mean((sarTest$bc_PredictedReadmissionRate - predictions$fit)^2))
To be able to determine how much better a job its doing (if any), we need to compare the RMSE of the new OLS (not the one from Project 1) to the RMSE from the SAR model we just calculated.
olsPredictions <- predict(OLS, newdata = sarTest)
olsRMSE <- sqrt(mean((sarTest$bc_PredictedReadmissionRate - olsPredictions)^2))
| Model | RMSE |
|---|---|
| OLS | 1.33 |
| SAR (Spatial Lag) | 1.15 |
Why is it challenging to interpret this RMSE in terms of the original/raw median pneumonia-related hospital readmissions? (I know.. I know… you’ve gotten it by now. Just checking.)
This RMSE is telling us the models performance at predicting the transformed data, which differs from the raw median readmissions.
So, let’s add the Mean Absolute Error (MAE) as well, which may be easier to interpret.
## OLS MAE
olsMAE <- mean(abs(sarTest$bc_PredictedReadmissionRate - olsPredictions))
## SAR MAE
sarMAE <- mean(abs(sarTest$bc_PredictedReadmissionRate - predictions$fit))
| Model | RMSE | MAE |
|---|---|---|
| OLS | 1.33 | 0.84 |
| SAR (Spatial Lag) | 1.15 | 0.79 |
Interpret the MAE and the RMSE. Which model is better? Don’t forget to interpret the RMSE in terms of the transformed target!
The RMSE for the OLS model is 1.33, while the RMSE for the SAR is 1.15. This tells us that on average, the SAR model is closer to the actual values of the transformed target than the OLS model is.
The MAE of the OLS model is 0.84, while the MAE for the SAR is 0.79. This tells us that the absolute error is lower by 0.05 for the SAR model than the OLS model.
Overall, both measures suggest that the SAR model is better at predicting the transformed target than the RMSE model.
The results of our SAR showed that we did not need to worry about any other types of spatial regression because the LM Test for spatial autocorrelation of the residuals was not significant. Still, I want to take a brief moment to explain what those models are and how we do them, using our current data as an example. They are important enough alternatives to the SAR we may need them in the future!
In both of these cases, you’d consider these if your LM Test for residual spatial autocorrelation was significant.
We’ve discussed how the SAR only models spatial “spillover” into the target, but we’ve also peeked at whether we have spatial spillover in the residuals (errors). Why would we care? One of the fundamental assumptions of an OLS regression is independence of our error terms, which is something we typically do not test for because we assume we have it based on our research/experiment/analytical design. But the moment we have spatial dependence, well, we may potentially violate this key assumption!
In our case, after controlling for all known predictors and modeling spatial dependence in the target only, we were able to uphold the assumption of independence (that’s what the LM test showed). But sometimes, even after SAR, our data may still show similar residual patterns. At such a point, the next step would be to fit a Spatial Error Model (SEM).
The general structure of the SEM is:
\(y = + \lambda W \epsilon + \beta_1 X_1 + ... +\beta_n X_n + E\)
where \(y\) is the target variable
(median pneumonia-related readmissions), \(W\) is the spatial weights matrix, which
defines the neighborhoods (stored in states_sw!), and \(\lambda\) is the _spatial error
coefficient__. This means that the first term of the equation, \(\lambda W \epsilon\), measures the spatial
correlation between our errors. So, if \(\lambda = 0\), the spatial correlation
between the errors isn’t enough to bias the standard errors. As with
SAR, the remaining terms, \(\beta_1 X_1 +
\beta_2 X_2 + ... \beta_n X_n\) are just the standard regression
coefficients for each of the \(n\)
predictors. \(E\) is an unmeasured
error term, which is now the overall error leftover after fitting the
SEM regression.
Fitting the model is pretty easy using the errorsarlm()
function:
## Fit the SEM (spatial error model) using the training data and weights matrix
SEM <- errorsarlm(bc_PredictedReadmissionRate ~ ., data = sarTrain,
listw = states_sw)
## Uncomment to view results!
summary(SEM)
##
## Call:errorsarlm(formula = bc_PredictedReadmissionRate ~ ., data = sarTrain,
## listw = states_sw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.627137 -0.230742 -0.030906 0.212419 0.731746
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate
## (Intercept) 0.0040570
## MRSA.Bacteremia -0.1927576
## Abdomen.CT.Use.of.Contrast.Material -0.0875032
## Death.rate.for.pneumonia.patients -0.2111690
## Postoperative.respiratory.failure.rate -0.2147449
## Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate 0.7302127
## Composite.patient.safety -0.3182179
## Medicare.spending.per.patient 0.4806595
## Healthcare.workers.given.influenza.vaccination -0.0044689
## Left.before.being.seen 0.1297515
## Venous.Thromboembolism.Prophylaxis -0.2324866
## Doctor.communication 0.4387042
## Communication.about.medicines -0.0940048
## Discharge.information -0.5370302
## Cleanliness -0.1512913
## Quietness -0.2218475
## Hospital.return.days.for.pneumonia.patients 0.0842956
## Std. Error
## (Intercept) 0.1240457
## MRSA.Bacteremia 0.2275375
## Abdomen.CT.Use.of.Contrast.Material 0.1472948
## Death.rate.for.pneumonia.patients 0.2039855
## Postoperative.respiratory.failure.rate 0.3005054
## Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate 0.2566630
## Composite.patient.safety 0.2052205
## Medicare.spending.per.patient 0.1430047
## Healthcare.workers.given.influenza.vaccination 0.1202604
## Left.before.being.seen 0.1361761
## Venous.Thromboembolism.Prophylaxis 0.2140493
## Doctor.communication 0.1937198
## Communication.about.medicines 0.2296558
## Discharge.information 0.1868993
## Cleanliness 0.1993748
## Quietness 0.1619164
## Hospital.return.days.for.pneumonia.patients 0.1412999
## z value Pr(>|z|)
## (Intercept) 0.0327 0.9739093
## MRSA.Bacteremia -0.8471 0.3969133
## Abdomen.CT.Use.of.Contrast.Material -0.5941 0.5524662
## Death.rate.for.pneumonia.patients -1.0352 0.3005683
## Postoperative.respiratory.failure.rate -0.7146 0.4748486
## Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate 2.8450 0.0044408
## Composite.patient.safety -1.5506 0.1209940
## Medicare.spending.per.patient 3.3611 0.0007762
## Healthcare.workers.given.influenza.vaccination -0.0372 0.9703570
## Left.before.being.seen 0.9528 0.3406804
## Venous.Thromboembolism.Prophylaxis -1.0861 0.2774190
## Doctor.communication 2.2646 0.0235353
## Communication.about.medicines -0.4093 0.6822981
## Discharge.information -2.8734 0.0040612
## Cleanliness -0.7588 0.4479551
## Quietness -1.3701 0.1706445
## Hospital.return.days.for.pneumonia.patients 0.5966 0.5507929
##
## Lambda: 0.61363, LR test value: 5.9309, p-value: 0.014877
## Asymptotic standard error: 0.12394
## z-value: 4.9512, p-value: 0.00000073764
## Wald statistic: 24.514, p-value: 0.00000073764
##
## Log likelihood: -12.18013 for error model
## ML residual variance (sigma squared): 0.086126, (sigma: 0.29347)
## Number of observations: 49
## Number of parameters estimated: 19
## AIC: 62.36, (AIC for lm: 66.291)
Test the null hypothesis that \(\lambda = 0\), i.e., that there is spatial autocorrelation in the residuals. What p-value do you get and what conclusions do you make?
The LR test produces a p-value of 0.014877, which is below 0.05. Therefore, we reject the null hypothesis. There is spatial autocorrelation in the residuals.
WHY might you find a significant SEM but the LM test from the SAR is NOT significant? What does that suggest about our SAR model? (This is not a trick question, we’ve already discussed it.)
SEM is looking at if there is spatial autocorrelation within the error, which differs from SAR, which looks at spatial spillover in the target.
Our results suggest that the assumption of error term independence is violated.
The Spatial Durbin Model (SDM) extends the Spatial Lag Model (SAR) by incorporating spatial lags of both the target variable AND the predictors. The idea here is that not there may be spatial autocorrelation in not only the target but also some (if not all) of the predictors. For example, if there was a policy or some underlying demography that affected the pneumonia-related hospital readmissions of one state that ALSO influenced that of a neighboring state, and we had measured that as a predictor (e.g., poverty rates or the number of elderly in the population), then an SDM would be a good idea provided that you had a significant LM test.
The general structure of the SDM is:
\(y = \rho Wy + \{\beta_1 X_1 + ... +\beta_n X_n\} + \{W_1X_1\theta_1 + ... + W_nX_n\theta_n\} + \epsilon\)
where \(y\) is the target variable
(median pneumonia-related readmissions), \(W\) is the spatial weights matrix, which
defines the neighborhoods (stored in states_sw!), and \(\rho\) is the spatial lag coefficient,
making the first term of the equation, \(\rho
Wy\), the same as it was in SAR! The next term, \(\beta_1 X_1 + \beta_2 X_2 + ... \beta_n
X_n\) are just the standard regression coefficients for each of
the \(n\) predictors. Then comes the
third term, \(\{W_1X_1\theta_1 + ... +
W_nX_n\theta_n\}\), which represents EACH of the spatially-lagged
predictors we are including. Note that we would not have to spatially
lag all predictors. As before, \(\epsilon\) is an unmeasured error term.
So, just by looking at all that mathy-math, we can see this model is a lot more complicated! The advantage, however, is that it would allow us to account for spatial autocorrelation in the target and predictors simultaneously, which is immensely powerful. And I like power! (Statistical power, that is.)
Fitting the model is pretty easy using the lagsarlm()
function:
SDM <- lagsarlm(formula = bc_PredictedReadmissionRate ~ ., data = sarTrain,
listw = states_sw, type = "mixed") ## "mixed" is necessary to specify as SDM
## Uncomment to view results!
summary(SDM)
##
## Call:lagsarlm(formula = bc_PredictedReadmissionRate ~ ., data = sarTrain,
## listw = states_sw, type = "mixed")
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.5119296 -0.0856881 0.0028116 0.1329395 0.4811692
##
## Type: mixed
## Coefficients: (asymptotic standard errors)
## Estimate
## (Intercept) -0.753483
## MRSA.Bacteremia 0.181542
## Abdomen.CT.Use.of.Contrast.Material 0.044127
## Death.rate.for.pneumonia.patients 0.135721
## Postoperative.respiratory.failure.rate -1.430492
## Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate 1.596331
## Composite.patient.safety -0.395633
## Medicare.spending.per.patient 0.841922
## Healthcare.workers.given.influenza.vaccination 0.104029
## Left.before.being.seen -0.025737
## Venous.Thromboembolism.Prophylaxis 0.399788
## Doctor.communication 0.335880
## Communication.about.medicines 0.391737
## Discharge.information -0.598922
## Cleanliness -0.501820
## Quietness -0.353278
## Hospital.return.days.for.pneumonia.patients 0.159403
## lag.MRSA.Bacteremia 1.556060
## lag.Abdomen.CT.Use.of.Contrast.Material 0.234453
## lag.Death.rate.for.pneumonia.patients 0.818625
## lag.Postoperative.respiratory.failure.rate -3.074017
## lag.Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate 2.572557
## lag.Composite.patient.safety 0.379481
## lag.Medicare.spending.per.patient 0.900775
## lag.Healthcare.workers.given.influenza.vaccination 0.146529
## lag.Left.before.being.seen -0.648498
## lag.Venous.Thromboembolism.Prophylaxis 1.813592
## lag.Doctor.communication -1.028270
## lag.Communication.about.medicines 0.476890
## lag.Discharge.information 1.418513
## lag.Cleanliness -0.607748
## lag.Quietness -0.821707
## lag.Hospital.return.days.for.pneumonia.patients 0.518370
## Std. Error
## (Intercept) 0.135373
## MRSA.Bacteremia 0.226457
## Abdomen.CT.Use.of.Contrast.Material 0.145975
## Death.rate.for.pneumonia.patients 0.183256
## Postoperative.respiratory.failure.rate 0.396346
## Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate 0.227073
## Composite.patient.safety 0.194300
## Medicare.spending.per.patient 0.134421
## Healthcare.workers.given.influenza.vaccination 0.132836
## Left.before.being.seen 0.131627
## Venous.Thromboembolism.Prophylaxis 0.325253
## Doctor.communication 0.219560
## Communication.about.medicines 0.224948
## Discharge.information 0.195077
## Cleanliness 0.224143
## Quietness 0.165374
## Hospital.return.days.for.pneumonia.patients 0.126643
## lag.MRSA.Bacteremia 0.459492
## lag.Abdomen.CT.Use.of.Contrast.Material 0.332267
## lag.Death.rate.for.pneumonia.patients 0.504535
## lag.Postoperative.respiratory.failure.rate 0.793894
## lag.Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate 0.730970
## lag.Composite.patient.safety 0.642436
## lag.Medicare.spending.per.patient 0.363571
## lag.Healthcare.workers.given.influenza.vaccination 0.275065
## lag.Left.before.being.seen 0.380962
## lag.Venous.Thromboembolism.Prophylaxis 0.698519
## lag.Doctor.communication 0.439437
## lag.Communication.about.medicines 0.740221
## lag.Discharge.information 0.426730
## lag.Cleanliness 0.600338
## lag.Quietness 0.256628
## lag.Hospital.return.days.for.pneumonia.patients 0.417160
## z value
## (Intercept) -5.5660
## MRSA.Bacteremia 0.8017
## Abdomen.CT.Use.of.Contrast.Material 0.3023
## Death.rate.for.pneumonia.patients 0.7406
## Postoperative.respiratory.failure.rate -3.6092
## Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate 7.0300
## Composite.patient.safety -2.0362
## Medicare.spending.per.patient 6.2633
## Healthcare.workers.given.influenza.vaccination 0.7831
## Left.before.being.seen -0.1955
## Venous.Thromboembolism.Prophylaxis 1.2292
## Doctor.communication 1.5298
## Communication.about.medicines 1.7415
## Discharge.information -3.0702
## Cleanliness -2.2388
## Quietness -2.1362
## Hospital.return.days.for.pneumonia.patients 1.2587
## lag.MRSA.Bacteremia 3.3865
## lag.Abdomen.CT.Use.of.Contrast.Material 0.7056
## lag.Death.rate.for.pneumonia.patients 1.6225
## lag.Postoperative.respiratory.failure.rate -3.8721
## lag.Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate 3.5194
## lag.Composite.patient.safety 0.5907
## lag.Medicare.spending.per.patient 2.4776
## lag.Healthcare.workers.given.influenza.vaccination 0.5327
## lag.Left.before.being.seen -1.7023
## lag.Venous.Thromboembolism.Prophylaxis 2.5963
## lag.Doctor.communication -2.3400
## lag.Communication.about.medicines 0.6443
## lag.Discharge.information 3.3241
## lag.Cleanliness -1.0123
## lag.Quietness -3.2019
## lag.Hospital.return.days.for.pneumonia.patients 1.2426
## Pr(>|z|)
## (Intercept) 0.000000026066269
## MRSA.Bacteremia 0.4227501
## Abdomen.CT.Use.of.Contrast.Material 0.7624312
## Death.rate.for.pneumonia.patients 0.4589309
## Postoperative.respiratory.failure.rate 0.0003071
## Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate 0.000000000002065
## Composite.patient.safety 0.0417303
## Medicare.spending.per.patient 0.000000000376876
## Healthcare.workers.given.influenza.vaccination 0.4335467
## Left.before.being.seen 0.8449781
## Venous.Thromboembolism.Prophylaxis 0.2190116
## Doctor.communication 0.1260692
## Communication.about.medicines 0.0816043
## Discharge.information 0.0021393
## Cleanliness 0.0251663
## Quietness 0.0326603
## Hospital.return.days.for.pneumonia.patients 0.2081464
## lag.MRSA.Bacteremia 0.0007080
## lag.Abdomen.CT.Use.of.Contrast.Material 0.4804273
## lag.Death.rate.for.pneumonia.patients 0.1046895
## lag.Postoperative.respiratory.failure.rate 0.0001079
## lag.Perioperative.pulmonary.embolism.or.deep.vein.thrombosis.rate 0.0004326
## lag.Composite.patient.safety 0.5547274
## lag.Medicare.spending.per.patient 0.0132278
## lag.Healthcare.workers.given.influenza.vaccination 0.5942364
## lag.Left.before.being.seen 0.0887056
## lag.Venous.Thromboembolism.Prophylaxis 0.0094223
## lag.Doctor.communication 0.0192852
## lag.Communication.about.medicines 0.5194105
## lag.Discharge.information 0.0008869
## lag.Cleanliness 0.3113745
## lag.Quietness 0.0013651
## lag.Hospital.return.days.for.pneumonia.patients 0.2140089
##
## Rho: -0.3198, LR test value: 1.8085, p-value: 0.17868
## Asymptotic standard error: 0.18114
## z-value: -1.7655, p-value: 0.077472
## Wald statistic: 3.1171, p-value: 0.077472
##
## Log likelihood: 14.78912 for mixed model
## ML residual variance (sigma squared): 0.031295, (sigma: 0.1769)
## Number of observations: 49
## Number of parameters estimated: 35
## AIC: NA (not available for weighted model), (AIC for lm: 40.23)
## LM test for residual autocorrelation
## test value: 6.2317, p-value: 0.012548
Now, you might notice something a bit odd here. Now, the test of the \(H_0: \rho = 0\) is NOT significant (p = 0.17868) but the LM test IS significant (p = 0.012548), which is a complete reversal. What gives?!
The reality is that we likely have yet unaccounted for spatial structure in the data, in the form of OMITTED VARIABLES. Yikes! How can we deal with that if we have no idea what we omitted?!
The SAR did a “good enough” job at inference. We’ve muddied the waters a bit by doing the SDM. Sometimes “good enough” is totally okay! However, don’t forget that for prediction purposes, the SAR model is not going to work.
There are very likely to be omitted spatial variables. We haven’t, for example, included anything about underlying demographics or Social Determinants of Health.
If we were determined to improve this, we would either
Briefly summarize a possible workflow, based on what we’ve done here as well as with help from other sources if needed, to use (1) Moran’s \(I\) and/or LISA, (2) SAR, (3) SEM, and (4) SDM. Table 8 (below) may help you or you’re free to refer to it to save writing. Feel free to write this is as numbered or bulleted list, if you’d like!
- Use Moran’s I and/or LISA to test for global/local spatial autocorrelation within the target or residuals
- Fit an OLS model to create an initial benchmark. Use Moran’s I/LISA to check for spatial autocorrelation and proceed if present.
Use the respective models under the listed conditions:
- If spatial dependence is in the target variable, fit a SAR model. Test for significance of Rho to confirm choice of model.
- If spatial dependence is in the residuals, fit a SEM model. Test for significance of Lambda to confirm choice of model.
- If spatial dependence is in both the target and predictors, fit a SDM model.
| Model | Spatial Dependence In | Use When |
|---|---|---|
| SAR (Spatial Lag/Autoregressive) | Target variable | Outcome in one area is influenced by neighboring outcomes |
| SEM (Spatial Error) | Residuals \(\epsilon\) | Unmeasured spatial effects or spatial error correlation |
| SDM (Spatial Durbin) | Both target and predictors | Capture spatial spillover of both outcomes and predictors |
DATA SCIENCE EXTENSIONS OF SPATIAL MODELING \(\rightarrow\) At this point, it should feel clear that we have spatial influence in our data with state-level differences in pneumonia-related hospital readmissions. Yet, building a robust predictive model is still not within our grasp. How could we get there?
Our next steps will involve moving in the direction of adding omitted spatially correlated features, but your first task is a thought experiment and a little research.
Do data scientists ever extract the spatially lagged targets/predictors and put them into machine learning models, like Elastic Net - or others like Random Forest or Neural Networks? Make sure to give an example from a scholarly or journalistic article.
HINT 1: Searching Google Scholar can be a starting point for academic articles.
HINT 2: The Harvard Data Science Review can be a solid starting point for articles that are more hybrid academic/journalistic.
HINT 3: Medium or LinkedIn can sometimes have decent how-to guides, but these articles are NOT validated in any way. If you choose an article from here, make sure you validate it with what we’ve done here and discussed in class!
Yes, it appears as if this practice is rather common in cases with spatial autocorrelation.
One example can be seen in the article “Random Forest and Gradient Boosting with Spatial Lags” (Kämäräinen et al., 2023). In this article, the researchers were looking to build models to predict carbon fluxes. The researchers incorporated spatially laggged predictors in random forest and gradient boosting models. These predictors included environmental facotrs like air tempurature and soil moisture which were compared to neighboring sites. The study demonstrated that the incorporation of such predictors increased the model’s accuracy.
From this, we can feel validated in our choice to include a spatially lagged predictor in our model for pneumonia readmissions.
Source:
Kämäräinen, M., Tuovinen, J.-P., Kulmala, M., Mammarella, I., Aalto, J., Vekuri, H., Lohila, A., and Lintunen, A.: Spatiotemporal lagging of predictors improves machine learning estimates of atmosphere–forest CO2 exchange, Biogeosciences, 20, 897–909, https://doi.org/10.5194/bg-20-897-2023, 2023.
DATA SCIENCE EXTENSIONS OF SPATIAL MODELING \(\rightarrow\) Why is the state-level spatial dataset inappropriate for any further machine-learning? In other words, why are we currently restricted with the dataset exactly as it is to just the SAR, SEM, or SDM?
HINT: Check the dimensions of sarTrain
if you need to!
dim(sarTrain)
## [1] 49 17
The train data is severly limited for the purposes of model building. 49 rows is a very small sample size for us to build predictive models with. Such few training examples will likely lead to models that are overfit, and therefore cannot generalize beyond the examples.
SAR, SEM, and SDM work well enough with this sample size, and are being used to test for spatial dependence. Other machine learning methods, however, will struggle with this sample size.
DATA SCIENCE EXTENSIONS OF SPATIAL MODELING \(\rightarrow\) Is there another spatial structure to this dataset that we could explore for use with machine learning instead of state-level?
Working at the level of county, city, or zip code could provide for many more rows of data than the state-level aggregates.